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Abstract 

Complex systems of polynomial equations have to be set up and solved algebraically in 
order to obtain analytic solutions for maximum likelihood on phylogcnctic trees. This has 
restricted the types of systems previously resolved to the simplest models - three and four 
taxa under a molecular clock, with just two state characters. In this work we give, for the first 
time, analytic solutions for a family of trees with four state characters, like normal DNA or 
RNA. The model of substitution we use is the Jukes-Cantor model, and the trees are on three 
taxa under molecular clock, namely rooted triplets. 

We employ a number of approaches and tools to solve this system: Spectral methods 
(Hadamard conjugation), a new representation of variables (the path-set spectrum), and al- 
gebraic geometry tools (the resultant of two polynomials). All these, combined with heavy 
application of computer algebra packages (Maple), let us derive the desired solution. 



Key words: Maximum likelihood, phylogenetic trees, Jukes-Cantor, Hadamard conjugation, 
analytical solutions, symbolic algebra. 



1 Introduction 



Maxi mum likelihood (M L) is increasingly used as an optimality criterion for selecting evolutionary 
trees (|Felsensteinl . Il98lh . but finding the global optimum is a hard computational task, which led 
to using mostly numeric me t hods. So far, analytic solutions have been derived only for the 
simplest models (|Yangl . I2OO0I : Ichor et all boqiL [2 003) - three and four taxa under a molecular 
clock, with just two state characters (|NevmarJ . ll97lh . In this work we present, for the first time, 
analytic solutions for a general family of trees with four state charac ters, like normal DNA o r 
RNA. The model of substitution we use is the Jukes-Cantor model (| Jukes and Cantor . Il96fil ). 
where all substitutions have the same rate. The trees we deal with are three taxa ones, namely 
rooted triplets (see Figure ^). 
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Figure 1: A rooted tree over 3 species. 



The change from two to four character states incurs a major increase in the complexity of the 
underlying algebraic system. Like previous works, our starting point is to present the general 
maximum likelihood problem on phylogenetic trees as a constrained optimization problem. This 
gives rise to a complex system of polynomial equations, and the goal is to solve them. The 
complexity of this system prevents any manual solution, so we relied heavily on Maple, a symbolic 
mathematical system. However, even with Maple, a simple attempt to s olve the system (e.g. just 
typing solve) fails , and additional tools are required. Spectral analysis ([Hendv and Penny , 1993; 



Hendv et allll994h uses Hadamard conjugation to express the expected frequencies of site patterns 
among sequences as a function of an evolutionary tree T and a model of sequence evolution along 
the edges of T. As in previous works, we used the Hadamard conjugation as a basic building 
block in our method of solution. However, it turns out that the specific way we represent the 
system, which is dete rmined by the choice of variables, plays a crucial role in the ability to solve 
it. In previous works dChor et allkoOllfc ooa). the va riables in the polynomials were based on the 



expected sequence spectrum (Hen dv and Penny . 1993). This representation leads to a system with 
two polynomials of degrees 11 and 10. This system is too complex to resolve with the available 
analytic and computational tools. By employing a different set of variables, based on the path-set 
spectrum, we were able to arrive at a simpler system of two polynomials whose degrees are 10 and 
8. To finesse the construction, we use algebraic geometry combined with Maple. Specifically, we 
now compute the resultant of the two polynomials, which yields a single, degree 11 polynomial in 
one variable. The root(s) of this polynomial yield the desired ML solution. We rema rk that similar 



resul ts to the ones shown here, were obtained by Hosten, Khetan and Sturmfels ([Hosten et al 
2004), however by using somewhat different techniques. 



It is evident that the currently available heuristic methods, fail to predict the correct tree even 
for small number of taxa. This is true not only in the presence of multiple ML points, but also 
in cases where the neighborhood of the (single) ML point is relatively flat. Therefore, a practical 
consequence of this work is the use of rooted triplets in supertree methods (constructing a big 
tree from small subtrees). For unrooted trees, the subtrees must have at least four leaves (e.g. 
the tree from quartets pr oblem). For rooted trees, it is sufficient to amalgamate a set of rooted 
triplets ( Ah o et all 119811 ) . Our work enables such approaches to rely on rooted ML triplets based 
on four characters states rather than two. 



Additionally, analytic solutions are capable to reveal properties of the maximum likelihood points 
that are not obtainable numerically. For small trees, our result can serve as a method for checking 
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the accuracy of the heuristic methods used in practice. 

The remainder of this work is organized as following: In the next section we provide definitions 
and notations used throughout the rest of this work. In Section we develop the identities and 
structures induced by the Jukes-Cantor model, while Section |I] develops maximum likelihood on 
phylogenetic trees and subsequently solves the system for the special case of three species under 
Jukes-Cantor and molecular clock. In Section El we show experimental results of applying our 
method on real genomic sequences, while Section |H1 concludes with three open problems. 



2 Definitions, Notations, and the Hadamard Conjugation 

In this section we define the model of substitution we use, introduce useful notations, and briefly 
describe the Hadamard conjugation for the Kimura models of substitutions. 

2.1 Definitions and Notations 

We start with a tree labelling notation that will be useful for the rest of the work. We illustrate 
it for n = 4 taxa, but the definitions extend to any n. A split of the species is any partition 
of {1,2,3,4} into two disjoint subsets. We will identify each split by the subset which does not 
contain 4 (in general, n), so that for example the split {{1, 2}, {3, 4}} is identified by the subset 
{1, 2}. For brevity, to label objects in a split, we concatenate the members of the split. Each edge 
e of a phylogenetic tree T induces a split of the taxa, which is the cut induced by removing e. We 
denote the edge e by the cut it induces. For instance the central edge of the tree T = (12) (34) 
induces the split {{1, 2}, {3, 4}}, that is identified by the subset {1,2} and therefore this edge is 
denoted e\2- Thus the set of edges of T is E(T) = {ei, e2, ei2, e%, 6123} (see Figure EJ. 




Figure 2: The tree T = (12) (34) and its edges 

Throughout the paper, we will index our vectors and matrices by a method denoted subsets in- 
dexing. We encode a subset of {1,2,... , n} in an (n)-long binary number where the ith least 
significant bit (i = l,...,n) is "1" if i is in the subset, and "0" otherwise. Using this repre- 
sentation, it is convenient to index the rows and columns of a matrix by subsets of {1, 2, . . . , n} 
in a lexicographically increasing order (i.e. 0, {1}, {2}, {1, 2}, . . . , {1, 2, . . . , n}). Table ^ illus- 
trates a matrix M indexed by subsets indexing over the set {1,2}. The general element of M 
corresponding to subsets D and E, is denoted be mu,E- 

Extending the alphabet from two to four character states significantly increases th e complexity 
of handling the data. In contrast to the binary treated in previous works (|Yand . 2000; 
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Chor et all l200ll . 1200.4 Ichor and Snhll2004h . where each site pattern in the sequence data induced 
a split, the four state site patterns induce a pair of splits. We will use the term substitution 
pattern to represent the substitutions to each taxon from a reference taxon. Let X = {1, 2, • • • , n} 
represent the set of taxa under study. We select n as a reference taxon and let X = X — {re} the 
set of non-referenced taxa. Consider a re-dimensional vector v over the DNA alphabet, where each 
entry i correspond to a taxon i in X. The vector v is called a character pattern. A substitution 
pattern is a (n — 1) -dimensional vector of the substitution types v n ~~» i>\ for i € X. 



For example, the character pattern 



A 
C 
T 
T 



induces the substitution pattern 



T ~» A 
T 

T T 



Suppose a phylogenetic tree T over the set of taxa X is given, with substitution probabilities on 
each of its edges. Then, the probability of obtaining each substitution pattern is well defined. We 
remark that the number of substitution pattern is S x S n_1 = S n . For some popular models, the 
set of substitutions is substantially smaller than the general case. 



The Kimura 3 substitution model ( Kimural . 1983^ . is a model of symmetric nucleotide substitu- 
tions, implying convergence to equal base frequencies. In that model, Kimura proposed 3 classes 
of substitution: transitions (denoted a, A G, T C), type I transversions (denoted /3, 
A T, G C) and type II transversions (denoted 7, A C, T G). Figure El illustrates 
these relations. We denote each of the substitution types with a pair of binary numbers: t a = ioi 
for transitions, t» = iiOj t~ = t\\ for transversions and we write t e = too f° r n ° substitution. 



The number of substitution patterns with this coding is 4 n (for every taxon, the substitution 
- * Vi is either of type t a , tp, i 7 or t £ ). 

We now define two subsets, D,E C X, as follows: D = {i:v n — * Vi G {tf3,t 7 }} and E = {i:v n — > 
v% € {t a ,tj}}. Since both D and E contain species with substitution type t 1 , they are not disjoint. 
To better understand this classification of the species into the sets D and E, we define an encoding 
of the character states as follows: 



.4 
C 
G 
T 



(1,0) 

(0,1) 

(1,1) 

(0,0) 





{} 


{1} 


{2} 


{1,2} 




00 


01 


10 


11 


{} 00 










{1}01 










{2} 10 








m 2 ,i2 


{1,2} 11 











Table 1: The matrix M indexed by split indexing. The element ?re{2},{i2} is placed in the (2,3) 
(binary (10, 11)) entry. 
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(a) 



(b) 



U(T) 



a 



(0,0) 



(0,1) 





a 



(1,1) 



Figure 3: (a) Kimura's 3-substitution model (K3ST). (b) Substitution types t a = toi, tp = Ho-, 
t-y = tn and t e = too- 



With this mapping, D contains the species i such that the first bit that encodes the state of i 
differs from the first bit that encodes the state of the reference species, n. The set E contains 
all species i such that the second bit that encodes the state of i differs from the second bit that 
encodes the state of species n. For example, suppose the character pattern v is as follows: 



species {%) state (yi) binary encoding substitution membership in D membership in E 



1 

2 
3 
4 



A 
C 
G 
T 



(1,0) 
(0,1) 

(1,1) 

(0,0) 



tp 

tat 



(1,0) 
(0,1) 

(1,1) 



1 



1 





1 
1 



We can view the set D (resp. E) as a split {D,X \ D} (resp. {E, X \ E}). We encode every 
substitutions pattern by the two ordered splits (D, E) that define it. Let sd,e be the probability 
of obtaining the substitution pattern (D, E) on a tree. Both D and E range over all subsets of 
X. Therefore it is natural to represent all probabilities sd,e i n a matrix S = indexed by 

subsets indexing over X x X. The rows are indexed by the split D and the columns by the split 
E. We call the matrix S the expected sequence spectrum. Since the number of splits over X is 
2 n ~\ S is a x 2 n ~ 1 matrix. 



For an edge e, let q e (ot), q e (P) & n d q e (l) be the expected number of substitutions of type a, 0, 
and 7, respectively. We call them the edge length parameters, so each edge is associated with 
three different "lengths", one per substitution type. Tree edges naturally correspond to splits. 
We extend the notion of edge lengths to splits that do not correspond to tree edges, by simply 
defining the length as zero: For a subset DCI such that D ^ and D is not an edge split, we 
set q D (0) =0,(0 e {a, A7»- For D = 0, we set q % (6) = -K(0) where K(6) is the sum of all 
other <7d(0) values. We define three vectors q e for 6 = a, /3, 7 indexed by subsets indexing over X 
as follows: q# = [qD(0)\D C X] . Then qd(#) = implies there is no edge en in T (e.g. qi3(#), 
q23(#) in T12). Figure Ufa) shows the edge length spectra for the tree T12 on n = 4 taxa that was 
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" -K(a) ' 




" ~K(J3) ' 




" -^(7) " 


qi(a) 




91 (P) 




9l(7) 


92(a) 




92(/3) 




92(7) 


912(a) 




912 (/?) 




912(7) 


©(a) 


1 Q/3 — 


93(/3) 




93(7) 




























. 9123(a) _ 




. 9123 (0) _ 




. 9123(7) . 



(a) 



Qi 



-K qi(a) q 2 (a) 912(a) 93(a) 9123(a) 
Ql(0) 9i(7) • • • • 

92(7) 

912(7) 

93(7) • 



<h(P) 

912 (P) 

gsiP) 



9123(0) 







• • 9123(7) 



(b) 

Figure 4: (a): Example edge length spectra for the tree T±2- (b): Q = Qt 12 



illustrated in Figure [21 

We will find it convenient to put these three vectors into a matrix Q(= Qt) = [iD e] of 2 n_1 
rows and columns indexed by subsets indexing over X x X, with = —(K(a) + K(/3) + if (7)), 
and the remaining entries of q a , q^ and q 7 becoming the leading row, column and main diagonal 
of Q respectively. All other entries of Q are set to 0. Figure GJb) shows the matrix Q = Qti 2 
holding the vectors q Q , q^, q 7 from Figure 0^a). This means that for D,E C {1,2,3}, Q$ : d = 
9o(a)) Qd,<& = 9d(/?), Qd,d = 9d(7) j and for all other entries, Qd,e = 0, except the first entry 
Q0 = —{K(a) + K(j3) + K{^)). The entries indicated by "•" are all zero, and are zero for every 
tree. The entries indicated by "0" are zero for this specific tree T12, but for different trees can be 
non-zero. The non-zero entries (in the leading row, column and main diagonal) should each be 
in the same component, and these identify the edge splits of T. For general trees on n taxa, the 
edge length spectra are vectors and square matrices of order 2 n_1 . 



2.2 Hadamard Conjugation 



The Hadamard conjugation (|Hendv and Penny . ll99.4lHenHv et a,llll994h is an invertible trans- 
formation that specifies a relation between the expected sequence spectrum S and the edge lengths 
spectra q(0) of the tree. In other words, the transformation links the probabilities of site substitu- 
tions on edges of an evolutionary tree T to the probabilities of obtaining each possible substitutions 
pattern. The Hadamard conjugation is applicable to a number of site substitution models: Ney- 
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2 s tate model, Ju kes-Cantor model ( Jukes and Cantor! 19691 ) . and Kimura 2ST and 3ST 
models ( Kimural . 1983h (the last three models correspond to four states characters, such as DNA 
or RNA). For these models, the transformation yields a powerful tool which greatly simplifies and 
unifies the analysis of phylogenetic data, and in particular the analytical approach to ML. 



Definition 1 A Hadamard matrix of order £ is an I x I matrix A with ±1 entries such that 
A t A = ih. 



We will use a special family of Hadamard matrices, called Sylvester matrices in Mac Williams and 

H n H n 



Sloan (1977, p. 45), defined inductively for n > by Hq = [1] and H n+ \ 
example, 



For 



Hi 



1 1 
1 -1 



and H2 



1111 
1-1 1-1 
1 1-1-1 
1-1-1 1 



H n is indexed by subsets indexing over {1, . . . , n} x {1, . . . , n}. Let hn e be the general element 
of H n . Then: 



Observation 1 ho,E = (— 1) 



\DnE\ 



This implies that H n is symmetric, namely H l n = H n , and thus by the definition of Hadamard 
matrices H~ x = ^H n . 



Proposition 1 (Hendy and Penny 1993) Let T be a phylogenetic tree on n leaves with finite 
edge lengths (q e (6) < 00 for all e £ E(T) and 6 £ {a, f3, j}). Assume that sites mutate according 
to a symmetric substitution model, with equal rates across sites. Let S be the expected sequence 
spectrum and Q the edge length spectrum as was described above. Then 



S = S(Q) = H-^eMHn-iQ) 



(1) 



where the exponentiation function exp{x) = e x is applied element wise to the matrix R = H n -iQ. 



This transformation is called the Hadamard conjugation. 



Definition 2 A matrix S € M 2 " 1 x M 2 " 1 satisfying Y^d ec{i n-i} &d,e = 1 an d H n -\S > 
is called conservative. 



For conservative data S, the Hadamard conjugation is invertible, yielding : 

Q = Q(S)= J ff-i 1 ln(F n _ 1 S) 
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site 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 





11 


00 


00 


11 


01 


00 


01 


10 


10 


11 


00 


01 


01 


10 


00 


10 


Cl = 


C 


C 


A 


T 


C 


A 


A 


A 


c 


G 


T 


G 


T 


G 


A 


c 




no 


00 


00 


00 


01 


00 


01 


00 


00 


00 


00 


00 


01 


00 


01 


10 


02 = 


A 


c 


A 


G 


c 


A 


A 


T 


G 


T 


T 


A 


T 


c 


T 


c 




11 


00 


00 


11 


00 


01 


01 


10 


00 


10 


00 


01 


00 


10 


01 


11 


0"3 = 


C 


c 


A 


T 


T 


G 


A 


A 


G 


A 


T 


G 


C 


G 


T 


T 


04 = 


A 


c 


A 


G 


T 


A 


G 


T 


G 


T 


T 


A 


C 


c 


A 


G 



a 



F = 



3 








2 


1 


1 


1 


1 


1 































































































2 


1 











2 











































b 


1 












Table 2: (a):Four aligned sequences with sixteen sites, (b): The corresponding observed sequence 
spectrum 

where the In function is applied element-wise to the matrix H n ^\S. We note that Q is not 
necessarily the edge length spectrum of any tree. On the other hand, the expected sequence 
spectrum of any tree T is always conservative. 

Consider now a set of n aligned homologous sequences a%, ■ ■ ■ , cr n , and denote this alignment as 
AL . We can view AL as a table where each column in this table induces a substitution pattern. 
Let Fd^e be the frequency of the substitution pattern represented by the splits (D,E). The 
matrix F = [Fjj^] is denoted as the observed sequence spectrum and is indexed analogously to 
the expected sequence spectrum matrix, S (that is, by subset indexing over X x X). 

Table |2fa) illustrates four sample DNA sequences with sixteen sites. 04 is the reference sequence, 
the pair of binary digits above each character of o\, ■ ■ ■ , 03 is the substitution type to derive that 
character from the homologous character of 04. For example, the entry 11 above G at site 10 of o\ 
indicates that the substitution to this nucleotide from the corresponding T of the reference sequence 
(T4 is of type i 7 . In (b), the frequencies of each of the site patterns from (a) are summarized in 
the observed sequence spectrum F. The rows of F are indexed by the first triple of the binary 
pairs, and the columns by the second, in the order 000,001,010,011,100,101,110,111. The site 
pattern of site 10 is represented by the pair (101,001) (or D = {1,3}, E = {1} alternatively) so 
the entry corresponding to this is in row 101 and column 001 of F. As this pattern occurs only at 
site 10, the entry in row 101 and column 001 of F is 1 (highlighted in bold font). We emphasize 
that the examples here refer to a tree on four leaves. The trees we solve for in the next sections 
have only three leaves. 



S 



3 Jukes— Cantor model for 3 sequences 



The Jukes-Cantor model of evolution ( Jukes and Cantor , 19691 ) is the simplest model for four 
states DNA evolution. The assumption in this model is that when a base changes, it has equal 
probabilities to change to each of the other three bases. This model can be derived from the 
more general Kimura 3— ST model by setting, for each edge of T, each of the three edge length 
parameters equal to a common value, namely setting q e (a) = q e (f3) = q e {l) = 9e- We now look on 
the tree T on three taxa {0, 1,2} before determining where the root is. T has just one topology, 
the star with the three edges e±, e 2 and e\ 2 . For convenience we will write the edge length of e\ 2 
as q 3 . 

We now define several auxiliary matrices that will be useful in the sequel: 



H 



1 


1 




1 


1 " 








" 1 


1 


1 


1 " 


1 


-1 




1 - 


-1 




,J 




1 


1 


1 


1 
























1 


1 




-1 - 


-1 






1 


1 


1 


1 


1 


-1 




-1 


1 . 








. 1 


1 


1 


1 . 


" 





1 


" 








' 








1 " 
















,A 


















,L 






















1 





1 











































1 








1 





10 









,A 1 



J-Aq-At-Ao- A 3 



The following identities relating these seven matrices, hold: 

HJH = 16 A , 

HA H = J, 
HA X H = 4(Ao + A 2 ) - J, 
HA 2 H = 4(Ao + Ai) - J, 
HA 3 H = 4(Ao + AO - J- 



' 


1 





" 


1 


1 




















. 








. 


" 








" 








1 


1 





1 





1 





1 


1 






(2) 

(3) 
(4) 
(5) 
(6) 



The edge-length spectrum of an arbitrary 3-tree can be expressed as the 4x4 matrix, 



Q 



-3(qi + q 2 + qs) qi <?2 93 

qi qi 

q 2 q 2 

q 3 q 3 



qiAx + q 2 A 2 + q 3 A 3 - 3(qi + q 2 + q 3 )A . 



Now, from Equations I2HH1 we see 

HQH = -4[( qi + q 3 )A x + (q 2 + q 3 )A 2 + {q x + q 2 )A 3 + {q x +q 2 + q 3 )L] 



qi + q 3 q 2 + q 3 qi + q 2 

qi + 93 qi + <?3 qi + 92 + 93 qi + 92 + 93 

92 + 93 91 + 92 + 93 92 + 93 9i + 92 + 93 

. 91 + 92 9i + 92 + 93 9i + 92 + 93 9i + 92 
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so applying the exponential function to each element of the matrix HQH we obtain the so called 
path-set spectrum, R: 



R 



where 



exp{HQH) 

A + X1X3A1 + x 2 x 3 A 2 + xix 2 A 3 + x\x 2 x 3 L 

1 X1X3 X 2 X 3 X\X 2 

X\X 3 X1X3 X\X 2 X 3 X\X 2 X 3 

x 2 x 3 XiX 2 X 3 x 2 x 3 XlX 2 X 3 

X\X 2 X\X 2 X 3 X\X 2 X 3 X\X 2 



(7) 



(8) 



The Xi values can replace the values as the defining parameters and are called the path set 
variables. The entries of R relate to the probabilities of differences between the end-points of 
paths in T. 



By using Proposition Q the expected sequence spectrum equals 



S 



H^RH- 1 

— [(1 + 3xix 2 + 3xix 3 -I- 3x 2 x 3 + 6x1X2X3)^0 
lb 

+(1 - xix 2 - X!X 3 + 3x 2 x 3 - 2x 1 x 2 x 3 ),4 1 
+(1 - xix 2 + 3xix 3 - x 2 x 3 - 2x1X2X3)^2 
+(1 + 3xix 2 - X1X3 - x 2 x 3 - 2x1X2X3)^3 
+(1 - xix 2 - X1X3 - x 2 x 3 -I- 2xix 2 x 3 )L] 
a Q 01 a 2 a 3 



(9) 



1 

16 



01 
a 3 



Ol 

0,4 



a 1 

a 2 

0,4 



a 3 



(10) 



where 



Of) 

a 2 

03 
«i 



(1 + 3xix 2 + 3xix 3 + 3x 2 x 3 + 6x1X2X3) 
(1 - x x x 2 - X1X3 + 3x 2 x 3 - 2xix 2 x 3 ), 
(1 - xix 2 + 3xix 3 - x 2 x 3 - 2xix 2 x 3 ), 
(1 + 3xix 2 - X1X3 - x 2 x 3 - 2xix 2 x 3 ), 
(1 - xix 2 - X1X3 - x 2 x 3 + 2xix 2 x 3 ). 



:n) 



Thus we see that each expected sequence frequency takes one of the above values, which are 
functions of the three parameters xi, X2 and X3. 



4 Obtaining the Maximum Likelihood Solution 



Given the observed frequencies, Fd,Ei of each site pattern (D,E) C X x X (normalised so that 
J2d ecx Fd,e = 1), then for any expected sequence spectrum S of some tree T, the likelihood of 
obtaining those normalised frequencies is 

L(F\T,S)= J] S F D D £. (12) 

D,ECX 



10 



1 2 



3 



Figure 5: A triplet tree under the molecular clock satisfies q\ = (fe- 

(It is convenient to use normalised frequencies as it simplifies the formulae later. This normal- 
isation scales log L by a constant factor, so does not affect the identity of the turning points.) 
Equation (10) gives identities among the pattern probabilities Sd,e so grouping the common 
factors in equation (12) gives 

4 

Ji 



where 



L(F\T,S) = l[, 



(13) 



3=0 



fo 

h 
h 

h 
h 



Ft 



'0,0' 



^0,1 + ^1,0 + ^1,1, 
-^0,2 + -^2,0 + ^2,2, 
-^0,12 + ^12,0 + -^12,12, 

-^1,2 + ^1,12 + ^2,1 + -^2,12 + -^12,1 + -^12,2- 



The expected sequence spectrum S can be expressed as a function of the three variables xi, X2 and 
X3, so the values which maximise the likelihood L are obt ained when t he th r ee partial derivatives , 



dL 

dx.i ' 



(i = 1, 2, 3), are ze ro. In contrast to previous works ()Chor et al 



2001 



Chor and Snir 
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4: 



1 ' ' [ I ' I I -1 * ' ' 

Chor et all 2000, 2003) that operated in the space of the expected sequence variables, Sd E, here 
we are operating in the space of the path-set variables. This eliminates the need to introduce the 
constraint of the ML points being on a "tree surface". By the chain rule, we get: 



= L-T^-^ = 0, fori = 1,2,3. 

rtnr ■ < n ■ rim ■ 



dxi ^— ' a,- dxi 

3=0 J 



(14) 



We require our ML tree to adhere to the molecular clock assumption, so a ((1, 2), 3)— triplet tree 
under this assumption requires q\ = q2 < q 3 (see Figure [SJ) which implies x\ = X2 > x 3 . In our 
analysis below we will explicitly impose the equality to find the turning points. The inequality 
will need to be tested on any potential solution, and if it were not satisfied, a maximum could be 
sought on the boundary of the valid tree domain, where x\ = X2 = X3. 

The constraint x\ = X2 implies a\ = 02, so by setting f'12 = fi + h and a\2 = a\ = 02 we 
reduce the complexity of equation (|14[) to give two rational equations in two free variables and 
the parameters fj: 

dL ( f da fudau f 3 da 3 fida A \ . 

oxi \ao oxi a\2 oxi a 3 oxi oxi J 



11 



These simultaneously vanish when the two numerators, which are polynomials in X2 , X3 and the 
parameters fj, are both zero. We refer to these polynomial equations as E\ and E^. 



We now show that the system of two resulting polynomials {E\,Ei} has only finitely many 
solutions, all of which we can find. The major tool used here is the resultant of two polynomials. 



Let f(x) = Y^i=o aiX% an d 9( x ) = Sj=o ^3 X ^ ^ e * wo polynomials in one variable, x. The resultant 
of / and g, denoted Res(f,g,x), is a polynomial in the coefficients ctj and bj of / and g, which 
is whenever / and g have a common zero. The coefficients can themselves be unknowns, or 
functions of other variables, in which case the resultant replaces the two polynomials / and g by 
a single polynomial in one fewer variable. 

Computing the resultant is a classical technique for eliminating one variable from two equations. 
There is an elegant formula for computing it due to Sylvester, and another due to Bezout, which 
have been implemented in most computer algebra packages, such as Maple. 

We can compute the resultant ER = Res(E\, E2, x%) of E% and E2 with respect to X3. This 
eliminates X3 from the equations and yields a single polynomial ER, in just X2 and the parameters. 
The polynomial ER has the form: 

ER = kf 3 f 12 f xff 4 (3^2 + 1) {2x 2 2 + x 2 + 1) (3z| + l) (3x^ + 3 x 2 + 2) 

(X2 - I)' (X2 + if ■ Po (16) 

where Po is a degree 11 polynomial with 288 monomials and k is some big constant. 

Theorem 1 The turning points of L ( equation corresponding to realistic trees ( namely, trees with 
positive edge lengths ) are exactly the roots of Po . 

Proof. The only term in ER except for Po (Equation I16[) that admits positive real roots is the term 
(x2 — 1). However, by the definition of X2, this root corresponds to q 2 — which is not a realistic tree. I 



Corollary 2 The Jukes- Cantor triplet has a finite number of ML points. 



Proof. Po has at most 11 different solutions and for each such a solution we back substitute to obtain 
all the values of X3 . ■ 



5 Results on Genomic Sequences 



In order to evaluate our method, we tested it on real genomic sequences. We looked at the NK cell receptor 
D gene on human, mouse and rat (accession n umbers AF26013 5 , AF0 30313 and AF009511 respectively). 
We aligned the sequences using CLUSTALW ^Thompson et af . 1994). Next, we computed the observed 



sequence spectrum, as explained in Section and illustrated in Table Three sequences have 16 site pat- 
terns and therefore the observed sequence spectrum is written in a 4-by-4 matrix. The resulting spectrum 
is shown in Table 

We calculated the maximum likelihood value for each of the three rooted trees under the model for the 
three species. As expected the ((rat, mouse), human) tree was maximal, with edge lengths qi = q 2 = 0.0197 
to rat and mouse and (73 = 0.1061 to human, giving the log likelihood InL = —870.2. 
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pattern 
frequency 


00 


01 


10 


11 


00 


424 


18 


18 


80 


01 


1 


7 


2 


2 


10 


7 


4 


4 


4 


11 


27 


1 


2 


40 



Table 3: The observed sequence spectrum of NK cell receptor D gene of human, mouse and rat 

We also calculated the maximum likelihood value for each of the three rooted trees for the beta actin gene 
for the three species guinea pig, goose and C elegans,(acc. numbers AF508792, M26111 and NM_076440 
resp.) finding the ((guinea pig, goose), C elegans) tree maximal, with qi = q 2 = 0.021819 and q% = 0.050188 
giving InL = —1241.5. Finally we calculated the maximum likelihood value for each of the three rooted 
trees for the histone gene of Drosophila melangoster, Hydra vulgaris and Human (acc numbers AY383571, 
AY383572 and NM_002107 resp.) finding the ((D. melangoster, H. vulgaris), Human) tree maximal, with 
qi=q 2 = 0.001555 and q 3 = 0.012740 with InL = -86.835133. 



Each of the results above agree clos ely with the nume rical values o btained using t he popular phyloge- 
netic reconstruction packages Phylip ( Fe lsenstei rl llQSQh and PAUP* llSwoffordl Fflflfft which use iterative 
methods to estimate the maxima. 



6 Directions for Future Research 



The progress made here brings up a number of open problems: 



• Our ML solutions are derived from the roots of a univariate, degree 11 polynomial. This implies that 
the number of ML solutions is finite. It would be interesting to explore the question of uniqueness 
of the solution. If this is the case, it will most likely fol low from the existence of a single solution 

20031). 



corresponding to a realistic tree, as in IjChor et al 



The Jukes-Cantor substitution model is the a special case of the family of Kimura substitution 
models. It would be interesting to further extend the result in this paper for the other models (two 
and three parameters) of the Kimura family. 

It would be interesting to extend these results to rooted trees with four leaves un der JC model and a 



molecular clock. Here we have two different topologies - the fork and the comb (iChor et al 
It is expected that such extension will face substantial technical difficulties. 



20031) . 
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